Lab 02: Visual data analysis

Introduction

This week's lab provides a practical introduction to visual data analysis. At the end of the lab, you should be able to use pandas to:

  • Import CSV data with multiple indices.
  • Make a bar chart.
  • Make a pie chart.
  • Make a histogram.
  • Make a scatter plot matrix.

As you will see later, pandas provides a lot of different options for plotting data and we only have time to look at a few in detail. If you're interested, you can find more information on plotting here.

The Iris flower data set

This week, we'll be using data from the Iris flower data set, a well-known data set in the world of data analysis. The data set consists of 150 samples taken from three different species of Iris (Iris setosa, Iris virginica and Iris versicolor; 50 samples each). In each case, the length and width (in centimeters) of the Iris' sepals and petals were measured. The purpose of the lab is to perform some exploratory visual analysis on the data to see what we can learn.

Getting started

Let's start by making sure that plots are displayed inline by issuing the magic command %matplotlib inline and importing pandas in the usual way:


In [ ]:
%matplotlib inline
import pandas as pd

Next, let's load the data. Write the path to your iris.csv file in the cell below:


In [ ]:
path_to_csv = "data/iris.csv"

If you open the CSV file manually (e.g. using spreadsheet software), you will see that it contains six columns: four describe the measurements of the data (i.e. the sepal and petal widths and lengths), while the remaining two describe the species of the data (i.e. Iris setosa, Iris virginica or Iris versicolor) and the index number of the sample, which ranges from 1 to 50 for each species.

If we index the measurement data using both the species and sample number information, we can refer to individual measurements per species. Fortunately, with pandas, this is easily done using the same read_csv command we used last week. The only difference this time is that we must pass a list of column names (in order) to use as the index.

Note: Last week, we passed index_col='year', so that we could index the baby name information in the data frame by year. This week, we'll use the same index_col argument, but instead pass a list of strings to use as indices, i.e. index_col=['index_1', 'index_2', ..., 'index_n'].

Execute the cell below to load the data into a pandas data frame and index that data frame by the species and sample_number columns:


In [ ]:
df = pd.read_csv(path_to_csv, index_col=['species', 'sample_number'])

Let's take a closer look at the data we've just loaded. We can start by using the head method to take a peak at the first few rows:


In [ ]:
df.head()

As you can see, we now have two index columns on the left side of the data frame: species and sample_number.

Exploratory data analysis

Summary statistics

Let's pick up where we left off in the last lab and compute some summary statistics. We can do this the same way we did last week, using the describe method, like this:


In [ ]:
df.describe()

This is good, but what if we want to see data relating to a specific species? This is where our index helps. Because we indexed the data frame using both species and sample numbers, we can reference all the data for a single species by indexing into the data frame using just that species name. For instance, to compute summary statistics for Iris setosa alone, we can write:


In [ ]:
df.loc['setosa'].describe()

As you can see, the summary for Iris setosa counts just fifty values, rather than the 150 values in the summary of the entire data frame.

What about comparing statistics across different species? Again, with pandas, this is easy. Last week, we saw that pandas has several methods for computing statistics for data frames, e.g. mean, median, min, max and std. Each of these methods accepts an optional level argument, which refers to the level of the index you want to compute the function on. For instance, to compute the mean of each sepal and petal length and width for each species, we can write:


In [ ]:
df.mean(level='species')

Here, we've set level='species' because we have two index columns: species and sample_number. If the level argument is not specified, calling mean on a data frame has the effect of computing the mean of each column across all indices:


In [ ]:
df.mean()

So why bother learning methods like mean, median and std when we can just call describe to get a complete summary? One reason is because the describe method does not accept a level argument and so calls to it will compute statistics for all of the data in the data frame, unless we index into the data frame first with a specific index reference (like we did earlier when computing a summary for Iris setosa).

Another reason is that calling these specific instance methods allows us to make further calls to pandas built-in visualisation methods. Let's take a look at some of these in more detail next.

Visual analysis

Manually examining tables of numerical values can be an exhausting experience. For instance, when we computed the mean sepal and petal widths and lengths for each species earlier, we ended up with a table of numbers, which doesn't provide much immediate insight into how these quantities vary over species without us having to manually compare the contents of each cell. In cases like this, using a visual technique can be a much better option because it gives us an immediate intuitive sense of what's going on. Let's take a look at a few commonly used techniques next.

Bar charts

Most of pandas plotting functionality is contained in the plot method of the data frame, which is itself a wrapper for the matplotlib plotting library. To compute a bar chart of some data, all we need to do is call the plot method on it with the optional argument kind='bar' to specify that we want a bar chart.

Note: If kind is not specified, then plot defaults to a line plot.

For instance, to create a bar chart of the mean value of each column in our data frame, all we need to do is call mean on the data frame and then call plot on the output of our call to mean (remembering to set kind='bar' too!), like this:


In [ ]:
df.mean().plot(kind='bar');

But what if we want to visualise how these quantities vary across species? Easy! All we need to do is pass level='species' to the mean method, like earlier:


In [ ]:
df.mean(level='species').plot(kind='bar');

We can also produce stacked bar charts, by passing the optional argument stacked=True, like in the cell below:

Note: By default, if it is not specified, stacked=False.


In [ ]:
df.mean(level=0).plot(kind='bar', stacked=True);

We can also make the bar chart horizontal rather than vertical by setting kind='barh', like this:


In [ ]:
df.mean(level=0).plot(kind='barh');

Note: pandas provides lots of options for plotting. You can find a comprehensive list (with sample code) here.

We now have a visual representation of the mean values of each sepal and petal measurement across each species. We don't need to worry about creating a chart legend, choosing bar colours or labelling our $x$ axis - pandas (via matplotlib) does this all automatically.

Pie charts

At the start of this lab, the Iris data set was described as containing 150 samples in total, consisting of 50 from each species. Up until now, we haven't really questioned whether this is true, although as good data analysts we should have! Let's rectify our oversight now.

As it turns out, counting things in pandas is a pretty trivial task. All we need to do is call the count method of the data frame and we get a summary of the number of data points in each column:

Note: If our columns have missing data (the Iris data set doesn't), then count will return the number of non-missing (i.e. valid) data points. We'll see this in a bit more detail next week.


In [ ]:
df.count()

Like the mean method, count accepts a level argument, so we can tell pandas to count the number of non-missing data points each species for each species like this:


In [ ]:
df.count(level='species')

It's a trivial example because each of the species has an equal number of samples, but why don't we use a pie chart to visualise this data?

Note: In many cases, sample categories are not evenly distributed, so examining the proportion of samples in each category is a valuble habit to form.

Creating pie charts with pandas works in a similar way to creating bar charts: we call plot on the data we want to visualise, passing kind='pie' instead of kind='bar' to specify the output chart type. There is one small difference though: pie charts cannot represent more than one column of data at a time, so we must tell pandas to make individual plots for each column by passing the optional argument subplots=True, as in the cell below:

Note: While pie charts are a good technique for visualising proportions, they are a poor technique for visualising almost anything else!


In [ ]:
df.count(level='species').plot(kind='pie', subplots=True);

The pie charts look a bit squashed, right? We can fix this by passing the optional argument figsize=(width,length), which adjusts the size of the chart output. For instance, to set the figure size to 16x4, we can write:


In [ ]:
df.count(level='species').plot(kind='pie', subplots=True, figsize=(16,4));

Note: Setting the figure size like this works for any call to pandas' plot method. Try adjusting the size of one the bar charts we made earlier for practice!

This is better, but the species labels in the chart are overlapping with the column names and the legend, which makes them harder to read. We can fix this by passing adjusting the rotation of the charts with the startangle argument, as follows.


In [ ]:
df.count(level='species').plot(kind='pie', subplots=True, figsize=(16,4), startangle=90);

We can fix this by reorganising the plots from a 4x1 grid into a 2x2 grid using the optional layout argument. To do this, all we need to do is pass layout=(2,2) to the plot command (also setting figsize=(12,12) to compensate for the change in size), like this:


In [ ]:
df.count(level=0).plot(kind='pie', subplots=True, figsize=(12,12), startangle=90, layout=(2,2));

Histograms

In lectures, we've seen how histograms can be used to visualise the distribution of the data in a sample. Again, using pandas, this is easily done using the plot method. The only difference this time is that we must set kind='hist'. For instance, to plot histograms for each column in the data frame, we can write:


In [ ]:
df.plot(kind='hist');

If we want to plot data for just one species of Iris, then we need to index into the data frame first (like we did with the describe method earlier), like this:


In [ ]:
df.loc['setosa'].plot(kind='hist');

Like with pie charts, we can force pandas to create a plot for each variable type by specifying the options subplots=True and layout=(2,2) (the figsize option is also used to make sure the plots are big enough to see), like this:


In [ ]:
df.loc['setosa'].plot(kind='hist', subplots=True, layout=(2,2), figsize=(12,6));

Next, we can increase the number of bins to get some finer detail about the distributions using the bins argument, like this:


In [ ]:
df.loc['setosa'].plot(kind='hist', subplots=True, layout=(2,2), figsize=(12,6), bins=30);

Try varying the number of bins to see the effect it has on the disributions. Remember, if the number of bins is too large, you'll start to get a "broken comb" look.

Scatter plot matrices

Creating histograms for a single species is informative, but we're missing out on the bigger picture. One of the most important aspects of exploratory data analysis is determining whether there are any relationships in your data and one of the best techniques for visualising this is the scatter plot matrix.

Before we get to building a matrix for our data, let's consider the quantitative approach first: computing correlations for each pair of variables in our data set. With pandas, this is easy - all we need to do is call the corr method on our data frame, like in the cell below, and pandas computes the Pearson correlation coefficient for each pair of variables in our data set and presents the data as a table.


In [ ]:
df.corr()

As you can see from the data, petal length is highly positively correlated to sepal length.

Note: By definition, a data sample is always completely positively correlated to itself (i.e. $r_{xx} = 1$). This is why the diagonal entries in the table above are all equal to one.

Next, let's consider the qualitative approach: in pandas, we can make a scatter plot from a data frame by passing it to pandas' scatter_matrix function, like this:

Note: Unlike the other methods we've covered today, scatter_matrix is not an option we can pass to plot, i.e. we cannot call df.plot(kind='scatter_matrix'). Instead, we must always pass the data frame to the method as in the cell below.


In [ ]:
pd.plotting.scatter_matrix(df, figsize=(12, 12));

This is good, but the points in our scatter plot are a little small. We can change this by specifying the optional s argument (s stands for size), like this:


In [ ]:
pd.plotting.scatter_matrix(df, figsize=(12, 12), s=100);

This is much clearer! We can now easily spot correlations visually, e.g. petal width and petal length, which will help us form hypotheses about what the final outcome of our analysis might be. For example, in this case, we might conclude that species with larger petal widths have larger petal lengths.